{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DeepDive Tutorial <small>Extracting mentions of spouses from the news</small>\n",
    "\n",
    "In this tutorial, we show an example of a prototypical task that DeepDive is often applied to:\n",
    "extraction of _structured information_ from _unstructured or 'dark' data_ such as web pages, text documents, images, etc.\n",
    "While DeepDive can be used as a more general platform for statistical learning and data processing, most of the tooling described herein has been built for this type of use case, based on our experience of successfully applying DeepDive to [a variety of real-world problems of this type](http://deepdive.stanford.edu/showcase/apps).\n",
    "\n",
    "In this setting, our goal is to take in a set of unstructured (and/or structured) inputs, and populate a relational database table with extracted outputs, along with marginal probabilities for each extraction representing DeepDive's confidence in the extraction.\n",
    "More formally, we write a DeepDive application to extract mentions of _relations_ and their constituent _entities_ or _attributes_, according to a specified schema; this task is often referred to as **_relation extraction_**.*\n",
    "Accordingly, we'll walk through an example scenario where we wish to extract mentions of two people being spouses from news articles.\n",
    "\n",
    "The high-level steps we'll follow are:\n",
    "\n",
    "1. **Data processing.** First, we'll load the raw corpus, add NLP markups, extract a set of _candidate_ relation mentions, and a sparse _feature_ representation of each.\n",
    "\n",
    "2. **Distant supervision with data and rules.** Next, we'll use various strategies to provide _supervision_ for our dataset, so that we can use machine learning to learn the weights of a model.\n",
    "\n",
    "3. **Learning and inference: model specification.** Then, we'll specify the high-level configuration of our _model_.\n",
    "\n",
    "4. **Error analysis and debugging.** Finally, we'll show how to use DeepDive's labeling, error analysis and debugging tools.\n",
    "\n",
    "*_Note the distinction between extraction of true, i.e., factual, relations and extraction of mentions of relations.\n",
    "In this tutorial, we do the latter, however DeepDive supports further downstream methods for tackling the former task in a principled manner._\n",
    "\n",
    "\n",
    "Whenever something isn't clear, you can always refer to [the complete example code at `examples/spouse/`](https://github.com/HazyResearch/deepdive/tree/master/examples/spouse/) that contains everything shown in this document."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Preparation\n",
    "\n",
    "### 0.0. Installing DeepDive and tweaking notebook\n",
    "First of all, let's make sure DeepDive is installed and can be used from this notebook.\n",
    "See [DeepDive installation guide](http://deepdive.stanford.edu/installation) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: PATH=/home/jovyan/local/bin:/opt/conda/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\n",
      "env: PATH=/ConfinedWater/deepdive-examples/spouse/deepdive/bin:/opt/conda/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\n",
      "deepdive is /usr/local/bin/deepdive\r\n"
     ]
    }
   ],
   "source": [
    "# PATH needs correct setup to use DeepDive\n",
    "import os; PWD=os.getcwd(); HOME=os.environ[\"HOME\"]; PATH=os.environ[\"PATH\"]\n",
    "# home directory installation\n",
    "%env PATH=$HOME/local/bin:$PATH\n",
    "# notebook-local installation\n",
    "%env PATH=$PWD/deepdive/bin:$PATH\n",
    "\n",
    "!type deepdive\n",
    "no_deepdive_found = !type deepdive >/dev/null\n",
    "if no_deepdive_found: # install it next to this notebook\n",
    "    !bash -c 'PREFIX=\"$PWD\"/deepdive bash <(curl -fsSL git.io/getdeepdive) deepdive_from_release'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to make sure this IPython/Jupyter notebook will work correctly with DeepDive:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# check if notebook kernel was launched in a Unicode locale\n",
    "import locale; LC_CTYPE = locale.getpreferredencoding()\n",
    "if LC_CTYPE != \"UTF-8\":\n",
    "    raise EnvironmentError(\"Notebook is running in '%s' encoding not compatible with DeepDive's Unicode output.\\n\\nPlease restart notebook in a UTF-8 locale with a command like the following:\\n\\n    LC_ALL=en_US.UTF-8 jupyter notebook\" % (LC_CTYPE))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0.1. Declaring what to predict\n",
    "\n",
    "Above all, we shall tell DeepDive what we want to predict as a *random variable* in a language called *DDlog*, stored in a file `app.ddlog`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file app.ddlog\n",
    "## Random variable to predict #################################################\n",
    "\n",
    "# This application's goal is to predict whether a given pair of person mention\n",
    "# are indicating a spouse relationship or not.\n",
    "has_spouse?(\n",
    "    p1_id text,\n",
    "    p2_id text\n",
    ")."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we are going to write our application in this `app.ddlog` one part at a time.\n",
    "We can check if the code make sense by asking DeepDive to compile it.\n",
    "DeepDive automatically compiles our application whenever we execute things after making changes, but we can also do this manually by running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "‘run/compiled’ -> ‘20161105/171411.425517242’\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive compile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0.2. Setting up a database\n",
    "\n",
    "Next, DeepDive will store all data—input, intermediate, output, etc.—in a relational database.\n",
    "Currently, Postgres and Greenplum are supported.\n",
    "For operating at a larger scale, Greenplum is strongly recommended.\n",
    "To set the location of this database, we need to configure a URL in the [`db.url` file](../examples/spouse/db.url), e.g.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!echo 'postgresql://'\"${PGHOST:-localhost}\"'/deepdive_spouse_$USER' >db.url"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have no running database yet, the following commands can quickly bring up a new PostgreSQL server to be used with DeepDive, storing all data at `run/database/postgresql` next to this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "no_database_running = !deepdive db is_ready || echo $?\n",
    "if no_database_running:\n",
    "    PGDATA = \"run/database/postgresql\"\n",
    "    !mkdir -p $PGDATA; test -s $PGDATA/PG_VERSION || pg_ctl init -D $PGDATA >/dev/null\n",
    "    !nohup pg_ctl -D $PGDATA -l $PGDATA/logfile start >/dev/null"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note: DeepDive will drop and then create this database if run from scratch—beware of pointing to an existing populated one!_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "‘run/RUNNING’ -> ‘20161105/171412.988492903’\n",
      "2016-11-05 17:14:13.099080 process/init/app/run.sh\n",
      "‘run/FINISHED’ -> ‘20161105/171412.988492903’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo init/app"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data processing\n",
    "\n",
    "In this section, we'll generate the traditional inputs of a statistical learning-type problem: candidate spouse relations, represented by a set of features, which we will aim to classify as _actual_ relation mentions or not.\n",
    "\n",
    "We'll do this in four basic steps:\n",
    "\n",
    "1. Loading raw input data\n",
    "2. Adding NLP markups\n",
    "3. Extracting candidate relation mentions\n",
    "4. Extracting features for each candidate\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1. Loading raw input data\n",
    "Our first task is to download and load the raw text of [a corpus of news articles provided by Signal Media](http://research.signalmedia.co/newsir16/signal-dataset.html) into an `articles` table in our database.\n",
    "\n",
    "Keeping the identifier of each article and its content in the table would be good enough.\n",
    "We can tell DeepDive to do this by declaring the schema of this `articles` table in our `app.ddlog` file; we add the following lines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## Input Data #################################################################\n",
    "articles(\n",
    "    id      text,\n",
    "    content text\n",
    ")."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DeepDive can use a script's output as a data source for loading data into the table if we follow a simple naming convention.\n",
    "We create a simple shell script at `input/articles.tsj.sh` that outputs the news articles in TSJ format (tab-separated JSONs) from the downloaded corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting input/articles.tsj.sh\n"
     ]
    }
   ],
   "source": [
    "%%file input/articles.tsj.sh\n",
    "#!/usr/bin/env bash\n",
    "set -euo pipefail\n",
    "cd \"$(dirname \"$0\")\"\n",
    "\n",
    "corpus=signalmedia/signalmedia-1m.jsonl\n",
    "[[ -e \"$corpus\" ]] || {\n",
    "    echo \"ERROR: Missing $PWD/$corpus\"\n",
    "    echo \"# Please Download it from http://research.signalmedia.co/newsir16/signal-dataset.html\"\n",
    "    echo\n",
    "    echo \"# Alternatively, use our sampled data by running:\"\n",
    "    echo \"deepdive load articles input/articles-100.tsv.bz2\"\n",
    "    echo\n",
    "    echo \"# Or, skipping all NLP markup processes by running:\"\n",
    "    echo \"deepdive create table sentences\"\n",
    "    echo \"deepdive load sentences\"\n",
    "    echo \"deepdive mark done sentences\"\n",
    "    false\n",
    "} >&2\n",
    "\n",
    "cat \"$corpus\" |\n",
    "#grep -E 'wife|husband|married' |\n",
    "#head -100 |\n",
    "jq -r '[.id, .content] | map(@json) | join(\"\\t\")'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to mark the script as an executable so DeepDive can actually execute it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod +x input/articles.tsj.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The aforementioned script reads a sample of the corpus (provided as lines of JSON objects), and then using the [jq](https://stedolan.github.io/jq/) language extracts the fields `id` (for article identifier) and `content` from each entry and format those into TSJ.\n",
    "We can uncomment the `grep` or `head` lines in between and apply some naive filter to subsample articles."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we tell DeepDive to execute the steps to load the `articles` table using the `input/articles.tsj.sh` script.\n",
    "You must have the [full corpus](http://research.signalmedia.co/newsir16/signal-dataset.html) downloaded at `input/signalmedia/signalmedia-1m.jsonl` for the following to finish correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171415.746380177’\n",
      "‘run/RUNNING’ -> ‘20161105/171416.578993662’\n",
      "2016-11-05 17:14:16.708589 process/init/relation/articles/run.sh\n",
      "\u001b[31m2016-11-05 17:14:16.708348 ################################################################################\n",
      "2016-11-05 17:14:16.708439 # Host: b7ea137f8e52\n",
      "2016-11-05 17:14:16.708456 # DeepDive: deepdive v0.8.0-742-g4b812a1 (Linux x86_64)\n",
      "2016-11-05 17:14:16.708467 export PATH='/usr/local/bin':\"$PATH\"\n",
      "2016-11-05 17:14:16.708477 export DEEPDIVE_PWD='/ConfinedWater/deepdive-examples/spouse'\n",
      "2016-11-05 17:14:16.708486 export DEEPDIVE_APP='/ConfinedWater/deepdive-examples/spouse'\n",
      "2016-11-05 17:14:16.708495 cd \"$DEEPDIVE_APP\"/run\n",
      "2016-11-05 17:14:16.708504 export DEEPDIVE_RUN_ID='20161105/171416.578993662'\n",
      "2016-11-05 17:14:16.708524 # Plan: 20161105/171416.578993662/plan.sh\n",
      "2016-11-05 17:14:16.708535 # Targets: articles\n",
      "2016-11-05 17:14:16.708543 ################################################################################\n",
      "2016-11-05 17:14:16.708551 \n",
      "2016-11-05 17:14:16.708570     # process/init/app/run.sh ####################################### last done: 2016-11-05T17:14:14+0000 (2s ago)\n",
      "2016-11-05 17:14:16.708589 process/init/relation/articles/run.sh ############################### last done: N/A\n",
      "2016-11-05 17:14:16.708599 ++ dirname process/init/relation/articles/run.sh\n",
      "2016-11-05 17:14:16.708615 + cd process/init/relation/articles\n",
      "2016-11-05 17:14:16.708624 + export DEEPDIVE_CURRENT_PROCESS_NAME=process/init/relation/articles\n",
      "2016-11-05 17:14:16.708633 + DEEPDIVE_CURRENT_PROCESS_NAME=process/init/relation/articles\n",
      "2016-11-05 17:14:16.708651 + deepdive create table articles\n",
      "2016-11-05 17:14:17.031058 CREATE TABLE\n",
      "2016-11-05 17:14:17.032028 + deepdive load articles\n",
      "2016-11-05 17:14:17.320186 Loading articles from input/articles.tsj.sh (tsj format)\n",
      "2016-11-05 17:14:17.323291 ERROR: Missing /ConfinedWater/deepdive-examples/spouse/input/signalmedia/signalmedia-1m.jsonl\n",
      "2016-11-05 17:14:17.323435 # Please Download it from http://research.signalmedia.co/newsir16/signal-dataset.html\n",
      "2016-11-05 17:14:17.323478 \n",
      "2016-11-05 17:14:17.323507 # Alternatively, use our sampled data by running:\n",
      "2016-11-05 17:14:17.323537 deepdive load articles input/articles-100.tsv.bz2\n",
      "2016-11-05 17:14:17.323568 \n",
      "2016-11-05 17:14:17.323596 # Or, skipping all NLP markup processes by running:\n",
      "2016-11-05 17:14:17.323624 deepdive create table sentences\n",
      "2016-11-05 17:14:17.323658 deepdive load sentences\n",
      "2016-11-05 17:14:17.323680 deepdive mark done sentences\n",
      "2016-11-05 17:14:17.459342 COPY 0\n",
      "\u001b[0m‘run/ABORTED’ -> ‘20161105/171416.578993662’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, a sample of 100 and 1000 articles can be downloaded from GitHub and loaded into DeepDive with the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100   171  100   171    0     0    352      0 --:--:-- --:--:-- --:--:--   352\n",
      "100  135k  100  135k    0     0   128k      0  0:00:01  0:00:01 --:--:--  574k\n",
      "CREATE TABLE\n",
      "Loading articles from input/articles-100.tsj.bz2 (tsj format)\n",
      "COPY 100\n",
      "ANALYZE\n"
     ]
    }
   ],
   "source": [
    "NUM_ARTICLES = 100\n",
    "ARTICLES_FILE = \"articles-%d.tsj.bz2\" % NUM_ARTICLES\n",
    "\n",
    "articles_not_done = !deepdive done articles || date\n",
    "if articles_not_done:\n",
    "    !cd input && curl -RLO \"https://github.com/HazyResearch/deepdive/raw/master/examples/spouse/input/\"$ARTICLES_FILE\n",
    "    !deepdive reload articles input/$ARTICLES_FILE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After DeepDive finishes creating the table and then fetching and loading the data, we can take a look at the loaded data using the following `deepdive query` command, which enumerates the values for the `id` column of the `articles` table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                  id                  \r\n",
      "--------------------------------------\r\n",
      " ba44d0cd-bff2-4875-8036-86f37419b5e7\r\n",
      " c5f8a528-cc0f-4f3e-aaef-b9e3b6b00325\r\n",
      " 0d07e617-00d4-4866-aee2-0ae197ae366f\r\n",
      " ebcd41ea-e5b4-43a4-9e16-4406d81cfcda\r\n",
      " 7516303b-0db5-477d-9e5d-243a73865e39\r\n",
      " f6e047d0-e409-42a6-ab0e-13ab926719a6\r\n",
      " 15d53efb-2151-4164-aee0-cae51faedeeb\r\n",
      " fe6e8fcc-1128-4410-923d-f05c42174336\r\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83\r\n",
      " 4336860e-fa87-4f54-b3ce-b4afb72c4acd\r\n",
      "(10 rows)\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive query '|10 ?- articles(id, _).'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2. Adding NLP markups\n",
    "Next, we'll use Stanford's [CoreNLP](http://stanfordnlp.github.io/CoreNLP/) natural language processing (NLP) system to add useful markups and structure to our input data.\n",
    "This step will split up our articles into sentences and their component _tokens_ (roughly, the words).\n",
    "Additionally, we'll get _lemmas_ (normalized word forms), _part-of-speech (POS) tags_, _named entity recognition (NER) tags_, and a dependency parse of the sentence.\n",
    "\n",
    "Let's first declare the output schema of this step in `app.ddlog`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## NLP markup #################################################################\n",
    "sentences(\n",
    "    doc_id         text,\n",
    "    sentence_index int,\n",
    "    tokens         json,\n",
    "    lemmas         json,\n",
    "    pos_tags       json,\n",
    "    ner_tags       json,\n",
    "    doc_offsets    json,\n",
    "    dep_types      json,\n",
    "    dep_tokens     json\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we declare a DDlog function which takes in the `doc_id` and `content` for an article and returns rows conforming to the sentences schema we just declared, using the **user-defined function (UDF)** in `udf/nlp_markup.sh`.\n",
    "We specify that this `nlp_markup` function should be run over each row from `articles`, and the output appended to `sentences`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "function nlp_markup over (\n",
    "        doc_id  text,\n",
    "        content text\n",
    "    ) returns rows like sentences\n",
    "    implementation \"udf/nlp_markup.sh\" handles tsj lines.\n",
    "\n",
    "sentences += nlp_markup(doc_id, content) :-\n",
    "    articles(doc_id, content).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This UDF `udf/nlp_markup.sh` is a Bash script which uses [our own wrapper around CoreNLP](https://github.com/HazyResearch/deepdive/tree/deepdive-corenlp/util/nlp)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p udf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting udf/nlp_markup.sh\n"
     ]
    }
   ],
   "source": [
    "%%file udf/nlp_markup.sh\n",
    "#!/usr/bin/env bash\n",
    "# Parse documents in tab-separated JSONs input stream with CoreNLP\n",
    "#\n",
    "# $ deepdive corenlp install\n",
    "# $ deepdive corenlp start\n",
    "# $ deepdive env udf/nlp_markup.sh\n",
    "# $ deepdive corenlp stop\n",
    "##\n",
    "set -euo pipefail\n",
    "cd \"$(dirname \"$0\")\"\n",
    "\n",
    "# some configuration knobs for CoreNLP\n",
    ": ${CORENLP_PORT:=$(deepdive corenlp unique-port)}  # a CoreNLP server started ahead of time is shared across parallel UDF processes\n",
    "# See: http://stanfordnlp.github.io/CoreNLP/annotators.html\n",
    ": ${CORENLP_ANNOTATORS:=\"\n",
    "        tokenize\n",
    "        ssplit\n",
    "        pos\n",
    "        ner\n",
    "        lemma\n",
    "        depparse\n",
    "    \"}\n",
    "export CORENLP_PORT\n",
    "export CORENLP_ANNOTATORS\n",
    "\n",
    "# make sure CoreNLP server is available\n",
    "deepdive corenlp is-running || {\n",
    "    echo >&2 \"PLEASE MAKE SURE YOU HAVE RUN: deepdive corenlp start\"\n",
    "    false\n",
    "}\n",
    "\n",
    "# parse input with CoreNLP and output a row for every sentence\n",
    "deepdive corenlp parse-tsj docid+ content=nlp -- docid nlp |\n",
    "deepdive corenlp sentences-tsj docid content:nlp \\\n",
    "                            -- docid nlp.{index,tokens.{word,lemma,pos,ner,characterOffsetBegin}} \\\n",
    "                                     nlp.collapsed-dependencies.{dep_type,dep_token}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we mark it as executable for DeepDive to run it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod +x udf/nlp_markup.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before executing this NLP markup step, we need to launch the CoreNLP server in advance, which may take a while to install and load everything.\n",
    "Note that the CoreNLP library requires Java 8 to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CoreNLP already installed at /deepdive/lib/stanford-corenlp/corenlp\n",
      "env: CORENLP_JAVAOPTS=-Xmx4g\n",
      "CoreNLP server at CORENLP_PORT=24393 starting...\n",
      "CoreNLP server at CORENLP_PORT=24393 ready.\n",
      "To stop it after final use, run: deepdive corenlp stop\n",
      "To watch its log, run: deepdive corenlp watch-log\n"
     ]
    }
   ],
   "source": [
    "!deepdive corenlp install\n",
    "# If CoreNLP seems to take forever to start, retry after uncommenting the following line:\n",
    "%env CORENLP_JAVAOPTS=-Xmx4g\n",
    "!deepdive corenlp start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171516.511996312’\n",
      "‘run/RUNNING’ -> ‘20161105/171518.022314534’\n",
      "2016-11-05 17:15:18.174210 process/ext_sentences_by_nlp_markup/run.sh\n",
      "2016-11-05 17:15:51.256532 deepdive mark 'done' data/sentences\n",
      "‘run/FINISHED’ -> ‘20161105/171518.022314534’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we take a look at a sample of the NLP markups, they will have tokens and NER tags that look like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                doc_id                | index |                                                                                                     tokens                                                                                                     |                                                                ner_tags                                                                 \n",
      "--------------------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |     0 | [\"Just\",\"what\",\"Sherlock\",\"needed\",\"after\",\"his\",\"relapse\",\":\",\"to\",\"come\",\"face-to-face\",\"with\",\"his\",\"daddy\",\"issues\",\".\"]                                                                                   | [\"O\",\"O\",\"PERSON\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\"]\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |     1 | [\"When\",\"Elementary\",\"returns\",\"this\",\"fall\",\",\",\"Sherlock\",\"-LRB-\",\"Jonny\",\"Lee\",\"Miller\",\"-RRB-\",\"will\",\"be\",\"dealing\",\"with\",\"the\",\"aftermath\",\"of\",\"his\",\"relapse\",\"in\",\"last\",\"season\",\"'s\",\"finale\",\".\"] | [\"O\",\"O\",\"O\",\"DATE\",\"DATE\",\"O\",\"PERSON\",\"O\",\"PERSON\",\"PERSON\",\"PERSON\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\"]\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |     2 | [\"One\",\"of\",\"the\",\"consequences\",\"?\"]                                                                                                                                                                          | [\"NUMBER\",\"O\",\"O\",\"O\",\"O\"]\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |     3 | [\"His\",\"father\",\"Morland\",\"Holmes\",\",\",\"played\",\"by\",\"John\",\"Noble\",\",\",\"is\",\"coming\",\"to\",\"New\",\"York\",\"to\",\"check\",\"up\",\"on\",\"his\",\"son\",\".\"]                                                                | [\"O\",\"O\",\"PERSON\",\"PERSON\",\"O\",\"O\",\"O\",\"PERSON\",\"PERSON\",\"O\",\"O\",\"O\",\"O\",\"LOCATION\",\"LOCATION\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\"]\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |     4 | [\"Morland\",\"is\",\"an\",\"international\",\"consultant\",\"who\",\"has\",\"a\",\"lot\",\"of\",\"power\",\"and\",\"has\",\"amassed\",\"a\",\"considerable\",\"fortune\",\".\"]                                                                   | [\"PERSON\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\"]\n",
      "(5 rows)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "deepdive query '\n",
    "    doc_id, index, tokens, ner_tags | 5\n",
    "    ?- sentences(doc_id, index, tokens, lemmas, pos_tags, ner_tags, _, _, _).\n",
    "'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3. Extracting candidate relation mentions\n",
    "\n",
    "#### Mentions of people\n",
    "Once again we first declare the schema:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## Candidate mapping ##########################################################\n",
    "person_mention(\n",
    "    mention_id     text,\n",
    "    mention_text   text,\n",
    "    doc_id         text,\n",
    "    sentence_index int,\n",
    "    begin_index    int,\n",
    "    end_index      int\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be storing each person as a row referencing a sentence with beginning and ending indexes.\n",
    "Again, we next declare a function that references a UDF and takes as input the sentence tokens and NER tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "function map_person_mention over (\n",
    "        doc_id         text,\n",
    "        sentence_index int,\n",
    "        tokens         text[],\n",
    "        ner_tags       text[]\n",
    "    ) returns rows like person_mention\n",
    "    implementation \"udf/map_person_mention.py\" handles tsj lines.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll write a simple UDF in Python that will tag spans of contiguous tokens with the NER tag `PERSON` as person mentions (i.e., we'll essentially rely on CoreNLP's NER module).\n",
    "Note that we've already used a Bash script as a UDF, and indeed any programming language can be used.\n",
    "(DeepDive will just check the path specified in the top line, e.g., `#!/usr/bin/env python`.)\n",
    "However, DeepDive provides some convenient utilities for Python UDFs which handle all IO encoding/decoding.\n",
    "To write our UDF `udf/map_person_mention.py`, we'll start by specifying that our UDF will handle TSV lines (as specified in the DDlog above).\n",
    "Additionally, we'll specify the exact type schema of both input and output, which DeepDive will check for us:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting udf/map_person_mention.py\n"
     ]
    }
   ],
   "source": [
    "%%file udf/map_person_mention.py\n",
    "#!/usr/bin/env python\n",
    "from deepdive import *\n",
    "\n",
    "@tsj_extractor\n",
    "@returns(lambda\n",
    "        mention_id       = \"text\",\n",
    "        mention_text     = \"text\",\n",
    "        doc_id           = \"text\",\n",
    "        sentence_index   = \"int\",\n",
    "        begin_index      = \"int\",\n",
    "        end_index        = \"int\",\n",
    "    :[])\n",
    "def extract(\n",
    "        doc_id         = \"text\",\n",
    "        sentence_index = \"int\",\n",
    "        tokens         = \"text[]\",\n",
    "        ner_tags       = \"text[]\",\n",
    "    ):\n",
    "    \"\"\"\n",
    "    Finds phrases that are continuous words tagged with PERSON.\n",
    "    \"\"\"\n",
    "    num_tokens = len(ner_tags)\n",
    "    # find all first indexes of series of tokens tagged as PERSON\n",
    "    first_indexes = (i for i in xrange(num_tokens) if ner_tags[i] == \"PERSON\" and (i == 0 or ner_tags[i-1] != \"PERSON\"))\n",
    "    for begin_index in first_indexes:\n",
    "        # find the end of the PERSON phrase (consecutive tokens tagged as PERSON)\n",
    "        end_index = begin_index + 1\n",
    "        while end_index < num_tokens and ner_tags[end_index] == \"PERSON\":\n",
    "            end_index += 1\n",
    "        end_index -= 1\n",
    "        # generate a mention identifier\n",
    "        mention_id = \"%s_%d_%d_%d\" % (doc_id, sentence_index, begin_index, end_index)\n",
    "        mention_text = \" \".join(map(lambda i: tokens[i], xrange(begin_index, end_index + 1)))\n",
    "        # Output a tuple for each PERSON phrase\n",
    "        yield [\n",
    "            mention_id,\n",
    "            mention_text,\n",
    "            doc_id,\n",
    "            sentence_index,\n",
    "            begin_index,\n",
    "            end_index,\n",
    "        ]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod +x udf/map_person_mention.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we write a simple function which extracts and tags all subsequences of tokens having the NER tag \"PERSON\".\n",
    "Note that the `extract` function must be a generator (i.e., use a `yield` statement to return output rows).\n",
    "\n",
    "Finally, we specify that the function will be applied to rows from the `sentences` table and append to the `person_mention` table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "person_mention += map_person_mention(\n",
    "    doc_id, sentence_index, tokens, ner_tags\n",
    ") :-\n",
    "    sentences(doc_id, sentence_index, tokens, _, _, ner_tags, _, _, _).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, to run, just compile and execute as in previous steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171552.714502406’\n",
      "‘run/RUNNING’ -> ‘20161105/171553.779649233’\n",
      "2016-11-05 17:15:53.959601 process/ext_person_mention_by_map_person_mention/run.sh\n",
      "2016-11-05 17:15:55.686093 deepdive mark 'done' data/person_mention\n",
      "‘run/FINISHED’ -> ‘20161105/171553.779649233’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo person_mention"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       name       |                 doc                  | sentence | begin | end \n",
      "------------------+--------------------------------------+----------+-------+-----\n",
      " Sherlock         | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        0 |     2 |   2\n",
      " Sherlock         | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        1 |     6 |   6\n",
      " Jonny Lee Miller | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        1 |     8 |  10\n",
      " Morland Holmes   | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        3 |     2 |   3\n",
      " John Noble       | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        3 |     7 |   8\n",
      " Morland          | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        4 |     0 |   0\n",
      " Rob Doherty      | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        5 |    27 |  28\n",
      " Sherlock         | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        8 |     1 |   1\n",
      " Mega Buzz        | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        9 |     6 |   7\n",
      " Holmes           | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10 |     5 |   5\n",
      " Morland          | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10 |    21 |  21\n",
      " Sherlock         | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10 |    27 |  27\n",
      " Tony             | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        0 |    17 |  17\n",
      " Jessie Mueller   | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        0 |    19 |  20\n",
      " Mueller          | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1 |     0 |   0\n",
      " Abby             | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1 |     4 |   4\n",
      " Carole King      | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1 |    14 |  15\n",
      " Abby Mueller     | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        2 |     0 |   1\n",
      " Abby Mueller     | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        4 |    13 |  14\n",
      " Jessie           | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        5 |    11 |  11\n",
      "(20 rows)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "deepdive query '\n",
    "    name, doc, sentence, begin, end | 20\n",
    "    ?- person_mention(p_id, name, doc, sentence, begin, end).\n",
    "'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mentions of spouses (pairs of people)\n",
    "Next, we'll take all pairs of **non-overlapping person mentions that co-occur in a sentence with less than 5 people total,** and consider these as the set of potential ('candidate') spouse mentions.\n",
    "We thus filter out sentences with large numbers of people for the purposes of this tutorial; however, these could be included if desired.\n",
    "Again, to start, we declare the schema for our `spouse_candidate` table—here just the two names, and the two `person_mention` IDs referred to:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "spouse_candidate(\n",
    "    p1_id   text,\n",
    "    p1_name text,\n",
    "    p2_id   text,\n",
    "    p2_name text\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, for this operation we don't use any UDF script, instead rely entirely on DDlog operations.\n",
    "We simply construct a table of person counts, and then do a join with our filtering conditions.\n",
    "In DDlog this looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "num_people(doc_id, sentence_index, COUNT(p)) :-\n",
    "    person_mention(p, _, doc_id, sentence_index, _, _).\n",
    "\n",
    "spouse_candidate(p1, p1_name, p2, p2_name) :-\n",
    "    num_people(same_doc, same_sentence, num_p),\n",
    "    person_mention(p1, p1_name, same_doc, same_sentence, p1_begin, _),\n",
    "    person_mention(p2, p2_name, same_doc, same_sentence, p2_begin, _),\n",
    "    num_p < 5,\n",
    "    p1 < p2,\n",
    "    p1_name != p2_name,\n",
    "    p1_begin != p2_begin.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's tell DeepDive to run what we have so far:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171556.664236290’\n",
      "‘run/RUNNING’ -> ‘20161105/171557.725553271’\n",
      "2016-11-05 17:15:57.944202 process/ext_num_people/run.sh\n",
      "2016-11-05 17:15:58.155046 deepdive mark 'done' data/num_people\n",
      "2016-11-05 17:15:58.185821 process/ext_spouse_candidate/run.sh\n",
      "2016-11-05 17:15:58.377589 deepdive mark 'done' data/spouse_candidate\n",
      "‘run/FINISHED’ -> ‘20161105/171557.725553271’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo spouse_candidate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       name1       |       name2       |                 doc                  | sentence \n",
      "-------------------+-------------------+--------------------------------------+----------\n",
      " Sherlock          | Jonny Lee Miller  | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        1\n",
      " Morland Holmes    | John Noble        | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |        3\n",
      " Sherlock          | Holmes            | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10\n",
      " Morland           | Holmes            | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10\n",
      " Morland           | Sherlock          | 8b31ede3-0f3b-431a-86a3-342ee18cfd83 |       10\n",
      " Tony              | Jessie Mueller    | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        0\n",
      " Carole King       | Abby              | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1\n",
      " Mueller           | Abby              | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1\n",
      " Mueller           | Carole King       | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        1\n",
      " Mueller           | Abby Mueller      | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        7\n",
      " Jessie            | Abby Mueller      | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        7\n",
      " Jessie            | Mueller           | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        7\n",
      " Jill Shellabarger | Matt              | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Roger Mueller     | Matt              | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Jill Shellabarger | Andrew            | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Roger Mueller     | Andrew            | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Matt              | Andrew            | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Roger Mueller     | Jill Shellabarger | 9b28e780-ba48-4a53-8682-7c58c141a1b6 |        8\n",
      " Khoury            | Greg Medcraft     | ebcd41ea-e5b4-43a4-9e16-4406d81cfcda |       34\n",
      " Dame Joan Collins | Jackie            | df13cc43-53fd-4f09-9a7e-d69b12a4adc0 |        0\n",
      "(20 rows)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "deepdive query '\n",
    "    name1, name2, doc, sentence | 20\n",
    "    ?- spouse_candidate(p1, name1, p2, name2),\n",
    "       person_mention(p1, _, doc, sentence, _, _).\n",
    "'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4. Extracting features for each candidate\n",
    "Finally, we will extract a set of **features** for each candidate:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## Feature Extraction #########################################################\n",
    " \n",
    "# Feature extraction (using DDLIB via a UDF) at the relation level\n",
    "spouse_feature(\n",
    "    p1_id   text,\n",
    "    p2_id   text,\n",
    "    feature text\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal here is to represent each spouse candidate mention by a set of attributes or **_features_** which capture at least the key aspects of the mention, and then let a machine learning model learn how much each feature is correlated with our decision variable ('is this a spouse mention?').\n",
    "For those who have worked with machine learning systems before, note that we are using a sparse storage representation-\n",
    "you could think of a spouse candidate `(p1_id, p2_id)` as being represented by a vector of length `L = COUNT(DISTINCT feature)`, consisting of all zeros except for at the indexes specified by the rows with key `(p1_id, p2_id)`.\n",
    "\n",
    "DeepDive includes an [automatic feature generation library, DDlib](http://deepdive.stanford.edu/gen_feats), which we will use here.\n",
    "Although many state-of-the-art [applications](http://deepdive.stanford.edu/showcase/apps) have been built using purely DDlib-generated features, others can be used and/or added as well.\n",
    "To use DDlib, we create a list of `ddlib.Word` objects, two `ddlib.Span` objects, and then use the function `get_generic_features_relation`, as shown in the following Python code for `udf/extract_spouse_features.py`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting udf/extract_spouse_features.py\n"
     ]
    }
   ],
   "source": [
    "%%file udf/extract_spouse_features.py\n",
    "#!/usr/bin/env python\n",
    "from deepdive import *\n",
    "import ddlib\n",
    "\n",
    "@tsj_extractor\n",
    "@returns(lambda\n",
    "        p1_id   = \"text\",\n",
    "        p2_id   = \"text\",\n",
    "        feature = \"text\",\n",
    "    :[])\n",
    "def extract(\n",
    "        p1_id          = \"text\",\n",
    "        p2_id          = \"text\",\n",
    "        p1_begin_index = \"int\",\n",
    "        p1_end_index   = \"int\",\n",
    "        p2_begin_index = \"int\",\n",
    "        p2_end_index   = \"int\",\n",
    "        doc_id         = \"text\",\n",
    "        sent_index     = \"int\",\n",
    "        tokens         = \"text[]\",\n",
    "        lemmas         = \"text[]\",\n",
    "        pos_tags       = \"text[]\",\n",
    "        ner_tags       = \"text[]\",\n",
    "        dep_types      = \"text[]\",\n",
    "        dep_parents    = \"int[]\",\n",
    "    ):\n",
    "    \"\"\"\n",
    "    Uses DDLIB to generate features for the spouse relation.\n",
    "    \"\"\"\n",
    "    # Create a DDLIB sentence object, which is just a list of DDLIB Word objects\n",
    "    sent = []\n",
    "    for i,t in enumerate(tokens):\n",
    "        sent.append(ddlib.Word(\n",
    "            begin_char_offset=None,\n",
    "            end_char_offset=None,\n",
    "            word=t,\n",
    "            lemma=lemmas[i],\n",
    "            pos=pos_tags[i],\n",
    "            ner=ner_tags[i],\n",
    "            dep_par=dep_parents[i] - 1,  # Note that as stored from CoreNLP 0 is ROOT, but for DDLIB -1 is ROOT\n",
    "            dep_label=dep_types[i]))\n",
    "\n",
    "    # Create DDLIB Spans for the two person mentions\n",
    "    p1_span = ddlib.Span(begin_word_id=p1_begin_index, length=(p1_end_index-p1_begin_index+1))\n",
    "    p2_span = ddlib.Span(begin_word_id=p2_begin_index, length=(p2_end_index-p2_begin_index+1))\n",
    "\n",
    "    # Generate the generic features using DDLIB\n",
    "    for feature in ddlib.get_generic_features_relation(sent, p1_span, p2_span):\n",
    "        yield [p1_id, p2_id, feature]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod +x udf/extract_spouse_features.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that getting the input for this UDF requires joining the `person_mention` and `sentences` tables:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "function extract_spouse_features over (\n",
    "        p1_id          text,\n",
    "        p2_id          text,\n",
    "        p1_begin_index int,\n",
    "        p1_end_index   int,\n",
    "        p2_begin_index int,\n",
    "        p2_end_index   int,\n",
    "        doc_id         text,\n",
    "        sent_index     int,\n",
    "        tokens         text[],\n",
    "        lemmas         text[],\n",
    "        pos_tags       text[],\n",
    "        ner_tags       text[],\n",
    "        dep_types      text[],\n",
    "        dep_tokens     int[]\n",
    "    ) returns rows like spouse_feature\n",
    "    implementation \"udf/extract_spouse_features.py\" handles tsj lines.\n",
    "\n",
    "spouse_feature += extract_spouse_features(\n",
    "    p1_id, p2_id, p1_begin_index, p1_end_index, p2_begin_index, p2_end_index,\n",
    "    doc_id, sent_index, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_tokens\n",
    ") :-\n",
    "    person_mention(p1_id, _, doc_id, sent_index, p1_begin_index, p1_end_index),\n",
    "    person_mention(p2_id, _, doc_id, sent_index, p2_begin_index, p2_end_index),\n",
    "    sentences(doc_id, sent_index, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_tokens).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's execute this UDF to get our features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171559.768170292’\n",
      "‘run/RUNNING’ -> ‘20161105/171600.894283335’\n",
      "2016-11-05 17:16:01.100184 process/ext_spouse_feature_by_extract_spouse_features/run.sh\n",
      "2016-11-05 17:16:11.115510 deepdive mark 'done' data/spouse_feature\n",
      "‘run/FINISHED’ -> ‘20161105/171600.894283335’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo spouse_feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we take a look at a sample of the extracted features, they will look roughly like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                                    feature                                                     \r\n",
      "----------------------------------------------------------------------------------------------------------------\r\n",
      " WORD_SEQ_[will try to apply those skills to his son remains to be seen , but Morland will stick his nose into]\r\n",
      " LEMMA_SEQ_[will try to apply those skill to he son remain to be see , but Morland will stick he nose into]\r\n",
      " NER_SEQ_[O O O O O O O O O O O O O O O PERSON O O O O O]\r\n",
      " POS_SEQ_[MD VB TO VB DT NNS TO PRP$ NN VBZ TO VB VBN , CC NNP MD VB PRP$ NN IN]\r\n",
      " W_LEMMA_L_1_R_1_[elder]_['s]\r\n",
      " W_NER_L_1_R_1_[O]_[O]\r\n",
      " W_LEMMA_L_1_R_2_[elder]_['s first]\r\n",
      " W_NER_L_1_R_2_[O]_[O ORDINAL]\r\n",
      " W_LEMMA_L_1_R_3_[elder]_['s first case]\r\n",
      " W_NER_L_1_R_3_[O]_[O ORDINAL O]\r\n",
      " W_LEMMA_L_2_R_1_[the elder]_['s]\r\n",
      " W_NER_L_2_R_1_[O O]_[O]\r\n",
      " W_LEMMA_L_2_R_2_[the elder]_['s first]\r\n",
      " W_NER_L_2_R_2_[O O]_[O ORDINAL]\r\n",
      " W_LEMMA_L_2_R_3_[the elder]_['s first case]\r\n",
      " W_NER_L_2_R_3_[O O]_[O ORDINAL O]\r\n",
      " W_LEMMA_L_3_R_1_[not the elder]_['s]\r\n",
      " W_NER_L_3_R_1_[O O O]_[O]\r\n",
      " W_LEMMA_L_3_R_2_[not the elder]_['s first]\r\n",
      " W_NER_L_3_R_2_[O O O]_[O ORDINAL]\r\n",
      "(20 rows)\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive query '| 20 ?- spouse_feature(_, _, feature).'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have generated what looks more like the standard input to a machine learning problem—a set of objects, represented by sets of features, which we want to classify (here, as true or false mentions of a spousal relation).\n",
    "However, we **don't have any supervised labels** (i.e., a set of correct answers) for a machine learning algorithm to learn from!\n",
    "In most real world applications, a sufficiently large set of supervised labels is _not_ available.\n",
    "With DeepDive, we take the approach sometimes referred to as _distant supervision_ or _data programming_, where we instead generate a **noisy set of labels using a mix of mappings from secondary datasets and other heuristic rules**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Distant supervision with data and rules\n",
    "\n",
    "In this section, we'll use _distant supervision_ (or '_data programming_') to provide a noisy set of labels for candidate relation mentions, with which we will train a machine learning model.\n",
    "\n",
    "We'll describe two basic categories of approaches:\n",
    "\n",
    "1. Mapping from secondary data for distant supervision\n",
    "2. Using heuristic rules for distant supervision\n",
    "\n",
    "Then, we'll describe a simple majority-vote approach to resolving multiple labels per example, which can be implemented within DDlog."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's declare a new table where we'll store the labels (referring to the spouse candidate mentions), with an integer value (`True=1, False=-1`) and a description (`rule_id`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## Distant Supervision ########################################################\n",
    "spouse_label(\n",
    "    p1_id   text,\n",
    "    p2_id   text,\n",
    "    label   int,\n",
    "    rule_id text\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's put all the spouse candidate mentions with a `NULL` label.  This is just for simplifying some steps later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# make sure all pairs in spouse_candidate are considered as unsupervised examples\n",
    "spouse_label(p1,p2, 0, NULL) :-\n",
    "    spouse_candidate(p1, _, p2, _).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1. Mapping from secondary data for distant supervision\n",
    "First, we'll try using an external structured dataset of known married couples, from [DBpedia](http://wiki.dbpedia.org/), to distantly supervise our dataset.\n",
    "We'll download the relevant data, and then map it to our candidate spouse relations.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extracting and downloading the DBpedia data\n",
    "Our goal is to first extract a collection of known married couples from DBpedia and then load this into the `spouses_dbpedia` table in our database.\n",
    "To extract known married couples, we use the DBpedia dump present in [Google's BigQuery platform](https://bigquery.cloud.google.com).\n",
    "First we extract the URI, name and spouse information from the DBpedia `person` table records in BigQuery for which the field `name` is not NULL.\n",
    "We use the following query:\n",
    "\n",
    "```sql\n",
    "SELECT URI,name, spouse\n",
    "FROM [fh-bigquery:dbpedia.person]\n",
    "where name <> \"NULL\"\n",
    "```\n",
    "\n",
    "We store the result of the above query in a local project table `dbpedia.validnames` and perform a self-join to obtain the pairs of married couples.\n",
    "\n",
    "```sql\n",
    "SELECT t1.name, t2.name\n",
    "FROM [dbpedia.validnames] AS t1\n",
    "JOIN EACH [dbpedia.validnames] AS t2\n",
    "ON t1.spouse = t2.URI\n",
    "```\n",
    "\n",
    "The output of the above query is stored in a new table named `dbpedia.spouseraw`.\n",
    "Finally, we use the following query to remove symmetric duplicates.\n",
    "\n",
    "```sql\n",
    "SELECT p1, p2\n",
    "FROM (SELECT t1_name as p1, t2_name as p2 FROM [dbpedia.spouseraw]),\n",
    "     (SELECT t2_name as p1, t1_name as p2 FROM [dbpedia.spouseraw])\n",
    "WHERE p1 < p2\n",
    "```\n",
    "\n",
    "The output of this query is stored in a local file.\n",
    "The file contains duplicate rows (BigQuery does not support `distinct`).\n",
    "It also contains noisy rows where the name field contains a string where the given name family name and multiple aliases were concatenated and reported in a string including the characters `{` and `}`.\n",
    "Using the Unix commands `sed`, `sort` and `uniq` we first remove the lines containing characters `{` and `}` and then duplicate entries.\n",
    "This results in an input file `spouses_dbpedia.csv` containing 6,126 entries of married couples.\n",
    "\n",
    "*Note that we made this [`spouses_dbpedia.csv` available for download from GitHub](https://github.com/HazyResearch/deepdive/blob/master/examples/spouse/input/spouses_dbpedia.csv.bz2), so you don't have to repeat the above process.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading DBpedia data to database\n",
    "\n",
    "To load the known married couples data into DeepDive, we first declare the schema in DDlog:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# distant supervision using data from DBpedia\n",
    "\n",
    "spouses_dbpedia(\n",
    "    person1_name text,\n",
    "    person2_name text\n",
    ")."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that we can easily load the data in `spouses_dbpedia.csv` data to the table we just declared if we follow DeepDive's convention of organizing input data under `input/` directory.\n",
    "The input file name simply needs to start with the target database table name.\n",
    "Let's download the file from GitHub to `input/spouses_dbpedia.csv.bz2` under our application:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100   174  100   174    0     0    360      0 --:--:-- --:--:-- --:--:--   360\n",
      "100 77463  100 77463    0     0  82313      0 --:--:-- --:--:-- --:--:-- 82313\n"
     ]
    }
   ],
   "source": [
    "!cd input && curl -RLO \"https://github.com/HazyResearch/deepdive/raw/master/examples/spouse/input/spouses_dbpedia.csv.bz2\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then execute this command to load it into the database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171613.590881475’\n",
      "‘run/RUNNING’ -> ‘20161105/171614.696781763’\n",
      "2016-11-05 17:16:14.842279 process/init/relation/spouses_dbpedia/run.sh\n",
      "2016-11-05 17:16:15.643122 deepdive mark 'done' data/spouses_dbpedia\n",
      "‘run/FINISHED’ -> ‘20161105/171614.696781763’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo spouses_dbpedia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the database should include tuples that look like the following:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        name1         |                  name2                  \r\n",
      "----------------------+-----------------------------------------\r\n",
      " 20th Earl of Arundel | Anne Howard Countess of Arundel\r\n",
      " Aadesh Shrivastava   | Vijayta Pandit\r\n",
      " Aafia Siddiqui       | Amjad Mohammed Khan\r\n",
      " A. A. Gill           | Amber Rudd\r\n",
      " Aamir Ali Malik      | Sanjeeda Shaikh\r\n",
      " Aamir Khan           | Kiran Rao\r\n",
      " Aarón Díaz           | Kate del Castillo\r\n",
      " Aaron Hotchner       | Beth Clemmons\r\n",
      " Aaron Spelling       | Carolyn Jones\r\n",
      " Aaron Staton         | Connie Fletcher\r\n",
      " Aarti Bajaj          | Anurag Kashyap\r\n",
      " Abbas                | Erum Ali\r\n",
      " Abbas Tyrewala       | Pakhi Tyrewala\r\n",
      " Abbe Lane            | Xavier Cugat\r\n",
      " Abbie G. Rogers      | Henry Huttleston Rogers\r\n",
      " Abby Jimenez         | Ramon Jimenez Jr.\r\n",
      " Abby Lockhart        | Luka Kovač\r\n",
      " Abby McDeere         | Mitch McDeere\r\n",
      " Abdel Hakim Amer     | Berlenti Abdul Hamid  برلنتي عبد الحميد\r\n",
      " Abdoulaye Wade       | Viviane Wade\r\n",
      "(20 rows)\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive query '| 20 ?- spouses_dbpedia(name1, name2).'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Supervising spouse candidates with DBpedia data\n",
    "\n",
    "Next we'll implement a simple distant supervision rule which labels any spouse mention candidate with a pair of names appearing in DBpedia as true:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "spouse_label(p1,p2, 1, \"from_dbpedia\") :-\n",
    "    spouse_candidate(p1, p1_name, p2, p2_name),\n",
    "    spouses_dbpedia(n1, n2),\n",
    "    [ lower(n1) = lower(p1_name), lower(n2) = lower(p2_name) ;\n",
    "      lower(n2) = lower(p1_name), lower(n1) = lower(p2_name) ].\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It should be noted that there are many clear ways in which this rule could be improved (fuzzy matching, more restrictive conditions, etc.), but this serves as an example of one major type of distant supervision rule."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2. Using heuristic rules for distant supervision\n",
    "We can also create a supervision rule which does not rely on any secondary structured dataset like DBpedia, but instead just uses some heuristic.\n",
    "We set up a DDlog function, `supervise`, which uses a UDF containing several heuristic rules over the mention and sentence attributes:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# supervision by heuristic rules in a UDF\n",
    "function supervise over (\n",
    "        p1_id text, p1_begin int, p1_end int,\n",
    "        p2_id text, p2_begin int, p2_end int,\n",
    "        doc_id         text,\n",
    "        sentence_index int,\n",
    "        sentence_text  text,\n",
    "        tokens         text[],\n",
    "        lemmas         text[],\n",
    "        pos_tags       text[],\n",
    "        ner_tags       text[],\n",
    "        dep_types      text[],\n",
    "        dep_tokens     int[]\n",
    "    ) returns (\n",
    "        p1_id text, p2_id text, label int, rule_id text\n",
    "    )\n",
    "    implementation \"udf/supervise_spouse.py\" handles tsj lines.\n",
    "\n",
    "spouse_label += supervise(\n",
    "    p1_id, p1_begin, p1_end,\n",
    "    p2_id, p2_begin, p2_end,\n",
    "    doc_id, sentence_index,\n",
    "    tokens, lemmas, pos_tags, ner_tags, dep_types, dep_token_indexes\n",
    ") :-\n",
    "    spouse_candidate(p1_id, _, p2_id, _),\n",
    "    person_mention(p1_id, p1_text, doc_id, sentence_index, p1_begin, p1_end),\n",
    "    person_mention(p2_id, p2_text,      _,              _, p2_begin, p2_end),\n",
    "    sentences(\n",
    "        doc_id, sentence_index,\n",
    "        tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_token_indexes\n",
    "    ).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Python UDF named [`udf/supervise_spouse.py`](https://github.com/HazyResearch/deepdive/blob/master/examples/spouse/udf/supervise_spouse.py) contains several heuristic rules:\n",
    "\n",
    "* Candidates with person mentions that are too far apart in the sentence are marked as false.\n",
    "* Candidates with person mentions that have another person in between are marked as false.\n",
    "* Candidates with person mentions that have words like \"wife\" or \"husband\" in between are marked as true.\n",
    "* Candidates with person mentions that have \"and\" in between and \"married\" after are marked as true.\n",
    "* Candidates with person mentions that have familial relation words in between are marked as false.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting udf/supervise_spouse.py\n"
     ]
    }
   ],
   "source": [
    "%%file udf/supervise_spouse.py\n",
    "#!/usr/bin/env python\n",
    "from deepdive import *\n",
    "import random\n",
    "from collections import namedtuple\n",
    "\n",
    "SpouseLabel = namedtuple('SpouseLabel', 'p1_id, p2_id, label, type')\n",
    "\n",
    "@tsj_extractor\n",
    "@returns(lambda\n",
    "        p1_id   = \"text\",\n",
    "        p2_id   = \"text\",\n",
    "        label   = \"int\",\n",
    "        rule_id = \"text\",\n",
    "    :[])\n",
    "# heuristic rules for finding positive/negative examples of spouse relationship mentions\n",
    "def supervise(\n",
    "        p1_id=\"text\", p1_begin=\"int\", p1_end=\"int\",\n",
    "        p2_id=\"text\", p2_begin=\"int\", p2_end=\"int\",\n",
    "        doc_id=\"text\", sentence_index=\"int\",\n",
    "        tokens=\"text[]\", lemmas=\"text[]\", pos_tags=\"text[]\", ner_tags=\"text[]\",\n",
    "        dep_types=\"text[]\", dep_token_indexes=\"int[]\",\n",
    "    ):\n",
    "\n",
    "    # Constants\n",
    "    MARRIED = frozenset([\"wife\", \"husband\"])\n",
    "    FAMILY = frozenset([\"mother\", \"father\", \"sister\", \"brother\", \"brother-in-law\"])\n",
    "    MAX_DIST = 10\n",
    "\n",
    "    # Common data objects\n",
    "    p1_end_idx = min(p1_end, p2_end)\n",
    "    p2_start_idx = max(p1_begin, p2_begin)\n",
    "    p2_end_idx = max(p1_end,p2_end)\n",
    "    intermediate_lemmas = lemmas[p1_end_idx+1:p2_start_idx]\n",
    "    intermediate_ner_tags = ner_tags[p1_end_idx+1:p2_start_idx]\n",
    "    tail_lemmas = lemmas[p2_end_idx+1:]\n",
    "    spouse = SpouseLabel(p1_id=p1_id, p2_id=p2_id, label=None, type=None)\n",
    "\n",
    "    # Rule: Candidates that are too far apart\n",
    "    if len(intermediate_lemmas) > MAX_DIST:\n",
    "        yield spouse._replace(label=-1, type='neg:far_apart')\n",
    "\n",
    "    # Rule: Candidates that have a third person in between\n",
    "    if 'PERSON' in intermediate_ner_tags:\n",
    "        yield spouse._replace(label=-1, type='neg:third_person_between')\n",
    "\n",
    "    # Rule: Sentences that contain wife/husband in between\n",
    "    #         (<P1>)([ A-Za-z]+)(wife|husband)([ A-Za-z]+)(<P2>)\n",
    "    if len(MARRIED.intersection(intermediate_lemmas)) > 0:\n",
    "        yield spouse._replace(label=1, type='pos:wife_husband_between')\n",
    "\n",
    "    # Rule: Sentences that contain and ... married\n",
    "    #         (<P1>)(and)?(<P2>)([ A-Za-z]+)(married)\n",
    "    if (\"and\" in intermediate_lemmas) and (\"married\" in tail_lemmas):\n",
    "        yield spouse._replace(label=1, type='pos:married_after')\n",
    "\n",
    "    # Rule: Sentences that contain familial relations:\n",
    "    #         (<P1>)([ A-Za-z]+)(brother|stster|father|mother)([ A-Za-z]+)(<P2>)\n",
    "    if len(FAMILY.intersection(intermediate_lemmas)) > 0:\n",
    "        yield spouse._replace(label=-1, type='neg:familial_between')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!chmod +x udf/supervise_spouse.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the rough theory behind this approach is that we don't need high-quality (e.g., hand-labeled) supervision to learn a high quality model.\n",
    "Instead, using statistical learning, we can in fact recover high-quality models from a large set of low-quality or **_noisy_** labels.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3. Resolving multiple labels per example with majority vote\n",
    "Finally, we implement a very simple majority vote procedure, all in DDlog, for resolving scenarios where a single spouse candidate mention has multiple conflicting labels.\n",
    "First, we sum the labels (which are all -1, 0, or 1):\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# resolve multiple labels by majority vote (summing the labels in {-1,0,1})\n",
    "spouse_label_resolved(p1_id, p2_id, SUM(vote)) :-\n",
    "    spouse_label(p1_id, p2_id, vote, rule_id).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we simply threshold and add these labels to our decision variable table `has_spouse` (see next section for details here):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# assign the resolved labels for the spouse relation\n",
    "has_spouse(p1_id, p2_id) = if l > 0 then TRUE\n",
    "                      else if l < 0 then FALSE\n",
    "                      else NULL end :- spouse_label_resolved(p1_id, p2_id, l)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, to execute all of the above, just run the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171617.316682458’\n",
      "‘run/RUNNING’ -> ‘20161105/171618.630553368’\n",
      "2016-11-05 17:16:18.966464 process/ext_spouse_label__0_by_supervise/run.sh\n",
      "2016-11-05 17:16:22.551696 deepdive mark 'done' data/spouse_label__0\n",
      "2016-11-05 17:16:22.588306 process/ext_spouse_label/run.sh\n",
      "2016-11-05 17:16:22.773465 deepdive mark 'done' data/spouse_label\n",
      "2016-11-05 17:16:22.802270 process/ext_spouse_label_resolved/run.sh\n",
      "2016-11-05 17:16:22.988895 deepdive mark 'done' data/spouse_label_resolved\n",
      "2016-11-05 17:16:23.017430 process/ext_has_spouse/run.sh\n",
      "2016-11-05 17:16:23.203629 deepdive mark 'done' data/has_spouse\n",
      "‘run/FINISHED’ -> ‘20161105/171618.630553368’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo has_spouse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that `deepdive do` will execute all upstream tasks as well, so this will execute all of the previous steps!\n",
    "\n",
    "Now, we can take a brief look at how many candidates are supervised by different rules, which will look something like the table below.\n",
    "Obviously, the counts will vary depending on your input corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           rule           | COUNT(1) \r\n",
      "--------------------------+----------\r\n",
      " neg:familial_between     |       26\r\n",
      " pos:wife_husband_between |       49\r\n",
      " neg:third_person_between |      174\r\n",
      " neg:far_apart            |      239\r\n",
      "                          |      636\r\n",
      "(5 rows)\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive query 'rule, @order_by COUNT(1) ?- spouse_label(p1,p2, label, rule).'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Learning and inference: model specification\n",
    "Now, we need to specify the actual model that DeepDive will perform learning and inference over.\n",
    "At a high level, this boils down to specifying three things:\n",
    "\n",
    "1. What are the _variables_ of interest that we want DeepDive to predict for us?\n",
    "\n",
    "2. What are the _features_ for each of these variables?\n",
    "\n",
    "3. What are the _connections_ between the variables?\n",
    "\n",
    "One we have specified the model in this way, DeepDive will _learn_ the parameters of the model (the weights of the features and potentially the connections between variables), and then perform _statistical inference_ over the learned model to determine the probability that each variable of interest is true.\n",
    "\n",
    "For more advanced users: we are specifying a _factor graph_ where the features are unary factors, and then using SGD and Gibbs sampling for learning and inference.\n",
    "Further technical detail is available [here](#).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1. Specifying prediction variables\n",
    "In our case, we have one variable to predict per spouse candidate mention, namely, **is this mention actually indicating a spousal relation or not?**\n",
    "In other words, we want DeepDive to predict the value of a Boolean variable for each spouse candidate mention, indicating whether it is true or not.\n",
    "Recall that we started this tutorial with specifying this at the beginning of [`app.ddlog`](app.ddlog) as follows:\n",
    "\n",
    "```ddlog\n",
    "has_spouse?(\n",
    "    p1_id text,\n",
    "    p2_id text\n",
    ").\n",
    "```\n",
    "\n",
    "DeepDive will predict not only the value of these variables, but also the marginal probabilities, i.e., the confidence level that DeepDive has for each individual prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2. Specifying features\n",
    "Next, we indicate (i) that each `has_spouse` variable will be connected to the features of the corresponding `spouse_candidate` row, (ii) that we wish DeepDive to learn the weights of these features from our distantly supervised data, and (iii) that the weight of a specific feature across all instances should be the same, as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "## Inference Rules ############################################################\n",
    " \n",
    "# Features\n",
    "@weight(f)\n",
    "has_spouse(p1_id, p2_id) :-\n",
    "    spouse_feature(p1_id, p2_id, f).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3. Specifying connections between variables\n",
    "Finally, we can specify dependencies between the prediction variables, with either learned or given weights.\n",
    "Here, we'll specify two such rules, with fixed (given) weights that we specify.\n",
    "First, we define a _symmetry_ connection, namely specifying that if the model thinks a person mention `p1` and a person mention `p2` indicate a spousal relationship in a sentence, then it should also think that the reverse is true, i.e., that `p2` and `p1` indicate one too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# Inference rule: Symmetry\n",
    "@weight(3.0)\n",
    "has_spouse(p1_id, p2_id) => has_spouse(p2_id, p1_id) :-\n",
    "    TRUE.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we specify a rule that the model should be strongly biased towards finding one marriage indication per person mention.\n",
    "We do this inversely, using a negative weight, as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Appending to app.ddlog\n"
     ]
    }
   ],
   "source": [
    "%%file -a app.ddlog\n",
    "\n",
    "# Inference rule: Only one marriage\n",
    "@weight(-1.0)\n",
    "has_spouse(p1_id, p2_id) => has_spouse(p1_id, p3_id) :-\n",
    "    TRUE.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4. Performing learning and inference\n",
    "\n",
    "Finally, to perform learning and inference using the specified model, we need to run the following command:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mapp.ddlog: updated since last `deepdive compile`\n",
      "\u001b[0m‘run/compiled’ -> ‘20161105/171629.260084208’\n",
      "‘run/RUNNING’ -> ‘20161105/171630.704470831’\n",
      "2016-11-05 17:16:31.297709 process/grounding/from_grounding/run.sh\n",
      "2016-11-05 17:16:31.312432 process/grounding/variable/has_spouse/materialize/run.sh\n",
      "2016-11-05 17:16:36.389875 process/grounding/variable_assign_id/run.sh\n",
      "2016-11-05 17:16:36.673670 process/grounding/factor/inf_imply_has_spouse_has_spouse_0/materialize/run.sh\n",
      "2016-11-05 17:16:42.015477 process/grounding/factor/inf_imply_has_spouse_has_spouse_1/materialize/run.sh\n",
      "2016-11-05 17:16:52.138722 process/grounding/factor/inf_istrue_has_spouse/materialize/run.sh\n",
      "2016-11-05 17:16:59.239468 process/grounding/assign_weight_id/run.sh\n",
      "2016-11-05 17:16:59.887608 process/grounding/factor/inf_imply_has_spouse_has_spouse_0/0/dump/run.sh\n",
      "2016-11-05 17:17:01.095601 process/grounding/factor/inf_imply_has_spouse_has_spouse_0/dump_weights/run.sh\n",
      "2016-11-05 17:17:02.291527 process/grounding/factor/inf_imply_has_spouse_has_spouse_1/0/dump/run.sh\n",
      "2016-11-05 17:17:03.516167 process/grounding/factor/inf_imply_has_spouse_has_spouse_1/dump_weights/run.sh\n",
      "2016-11-05 17:17:04.702877 process/grounding/factor/inf_istrue_has_spouse/0/dump/run.sh\n",
      "2016-11-05 17:17:05.028324 process/grounding/factor/inf_istrue_has_spouse/dump_weights/run.sh\n",
      "2016-11-05 17:17:05.287317 process/grounding/global_weight_table/run.sh\n",
      "2016-11-05 17:17:05.470889 process/grounding/variable_holdout/run.sh\n",
      "2016-11-05 17:17:06.148270 process/grounding/variable/has_spouse/0/dump/run.sh\n",
      "2016-11-05 17:17:07.342957 process/grounding/combine_factorgraph/run.sh\n",
      "2016-11-05 17:17:07.424862 process/model/learning/run.sh\n",
      "2016-11-05 17:17:09.555646 process/model/inference/run.sh\n",
      "2016-11-05 17:17:09.592518 process/model/load_probabilities/run.sh\n",
      "2016-11-05 17:17:10.366123 deepdive mark 'done' data/model/probabilities\n",
      "‘run/FINISHED’ -> ‘20161105/171630.704470831’\n"
     ]
    }
   ],
   "source": [
    "!deepdive redo probabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will ground the model based on the data in the database, learn the weights, infer the expectations or marginal probabilities of the variables in the model, and then load them back to the database.\n",
    "\n",
    "Let's take a look at the probabilities inferred by DeepDive for the `has_spouse` variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                     p1_id                     |                     p2_id                     | expectation \r\n",
      "-----------------------------------------------+-----------------------------------------------+-------------\r\n",
      " 8b31ede3-0f3b-431a-86a3-342ee18cfd83_10_27_27 | 8b31ede3-0f3b-431a-86a3-342ee18cfd83_10_5_5   |           0\r\n",
      " acedaa54-9820-4b71-aa7b-38dc7ed1d2a6_0_35_35  | acedaa54-9820-4b71-aa7b-38dc7ed1d2a6_0_37_38  |       0.032\r\n",
      " 328623e0-52f3-44a6-b66b-496cd9d93762_3_1_1    | 328623e0-52f3-44a6-b66b-496cd9d93762_3_23_24  |       0.008\r\n",
      " c27a162d-f2d1-4bdb-84ba-0915a082775b_32_21_21 | c27a162d-f2d1-4bdb-84ba-0915a082775b_32_31_31 |       0.019\r\n",
      " f6e047d0-e409-42a6-ab0e-13ab926719a6_19_24_25 | f6e047d0-e409-42a6-ab0e-13ab926719a6_19_31_32 |       0.015\r\n",
      " 172960c6-cb26-4cd1-99a8-d7cb92f8dec8_29_15_15 | 172960c6-cb26-4cd1-99a8-d7cb92f8dec8_29_7_8   |       0.034\r\n",
      " 9662058b-fca5-4771-8058-c7fd7bd548a3_3_0_1    | 9662058b-fca5-4771-8058-c7fd7bd548a3_3_17_18  |        0.03\r\n",
      " 693ae030-4239-4291-b248-dbf7c1696ff2_4_15_15  | 693ae030-4239-4291-b248-dbf7c1696ff2_4_2_2    |           0\r\n",
      " eacc9625-b22d-4a44-a62e-7d53c132af1a_14_0_0   | eacc9625-b22d-4a44-a62e-7d53c132af1a_14_14_15 |        0.01\r\n",
      " dbc798be-9a6e-48b7-8721-31f84e89c10b_27_15_15 | dbc798be-9a6e-48b7-8721-31f84e89c10b_27_2_2   |       0.007\r\n",
      " 18658e4a-a94e-478f-ab2e-2ee709bd47e5_8_11_11  | 18658e4a-a94e-478f-ab2e-2ee709bd47e5_8_21_22  |       0.027\r\n",
      " b4968e78-ec5a-466e-863f-fef18e8ae99d_34_33_33 | b4968e78-ec5a-466e-863f-fef18e8ae99d_34_39_39 |        0.01\r\n",
      " 6779b9e1-073a-4adb-a20d-7d11c61410c9_1_0_0    | 6779b9e1-073a-4adb-a20d-7d11c61410c9_1_6_6    |       0.006\r\n",
      " 7e5f4072-b69f-4819-8ed6-62bdd0100621_13_14_15 | 7e5f4072-b69f-4819-8ed6-62bdd0100621_13_21_22 |       0.007\r\n",
      " acedaa54-9820-4b71-aa7b-38dc7ed1d2a6_1_12_12  | acedaa54-9820-4b71-aa7b-38dc7ed1d2a6_1_48_48  |       0.008\r\n",
      " 9662058b-fca5-4771-8058-c7fd7bd548a3_34_0_0   | 9662058b-fca5-4771-8058-c7fd7bd548a3_34_6_7   |       0.023\r\n",
      " 23490793-bb60-44c0-bbec-9c3be871d762_15_17_18 | 23490793-bb60-44c0-bbec-9c3be871d762_15_21_22 |       0.036\r\n",
      " d6880afb-7fcb-4576-9d17-cedd343677f9_29_0_0   | d6880afb-7fcb-4576-9d17-cedd343677f9_29_20_20 |       0.008\r\n",
      " c27a162d-f2d1-4bdb-84ba-0915a082775b_19_26_26 | c27a162d-f2d1-4bdb-84ba-0915a082775b_19_5_5   |           0\r\n",
      " 0a74a914-54fb-47bc-acae-5dcd10ed5c3d_5_25_25  | 0a74a914-54fb-47bc-acae-5dcd10ed5c3d_5_3_3    |           0\r\n",
      "(20 rows)\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!deepdive sql 'SELECT p1_id, p2_id, expectation FROM has_spouse_inference ORDER BY random() LIMIT 20'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 4. Error analysis and debugging\n",
    "\n",
    "After finishing a pass of writing and running the DeepDive application, the first thing we want to see is how good the results are.\n",
    "In this section, we describe how DeepDive's interactive tools can be used for viewing the results as well as error analysis and debugging."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1. Calibration Plots\n",
    "\n",
    "DeepDive provides *calibration plots* to see how well the expectations computed by the system are calibrated.\n",
    "The following command generates a plot for each variable under `run/model/calibration-plots/`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!deepdive do calibration-plots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It will produce a file `run/model/calibration-plots/has_spouse.png` that holds three plots as shown below:\n",
    "![Calibration plot for spouse example](run/model/calibration-plots/has_spouse.png)\n",
    "\n",
    "Refer to the [full documentation on calibration data](calibration.md) for more detail on how to interpret the plots and take actions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2. Browsing data with Mindbender\n",
    "\n",
    "*Mindbender* is the name of the tool that provides an interactive user interface to DeepDive.\n",
    "It can be used for browsing any data that has been loaded into DeepDive and produced by it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Browsing input corpus\n",
    "\n",
    "We need to give hints to DeepDive about which part of the data we want to browse [using DDlog's annotation](http://deepdive.stanford.edu/browsing#ddlog-annotations-for-browsing).\n",
    "For example, on the `articles` relation we declared earlier in `app.ddlog`, we can sprinkle some annotations such as `@source`, `@key`, and `@searchable`, as the following.\n",
    "\n",
    "\n",
    "```ddlog\n",
    "@source\n",
    "articles(\n",
    "    @key\n",
    "    id text,\n",
    "    @searchable\n",
    "    content text\n",
    ").\n",
    "```\n",
    "\n",
    "The fully annotated DDlog code is available at GitHub and can be downloaded to replace your `app.ddlog` by running the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!curl -RLO \"https://github.com/HazyResearch/deepdive/raw/master/examples/spouse/app.ddlog\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, if we run the following command, DeepDive will create and populate a search index according to these hints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!mindbender search drop; mindbender search update"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To access the populated search index through a web browser, run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!mindbender search gui"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, point your browser to the URL that appears after the command (typically <http://localhost:8000>) to see a view that looks like the following:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of the search interface showing input corpus](https://github.com/HazyResearch/deepdive/raw/master/doc/images/browsing_corpus.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Browsing result data\n",
    "\n",
    "To browse the results, we can add annotations to the inferred relations and how they relate to their source relations.\n",
    "For example, the `@extraction` and `@references` annotations in the following DDlog declaration tells DeepDive that the variable relation `has_spouse` is inferred from pairs of `person_mention`.\n",
    "\n",
    "```ddlog\n",
    "@extraction\n",
    "has_spouse?(\n",
    "    @key\n",
    "    @references(relation=\"person_mention\", column=\"mention_id\", alias=\"p1\")\n",
    "    p1_id text,\n",
    "    @key\n",
    "    @references(relation=\"person_mention\", column=\"mention_id\", alias=\"p2\")\n",
    "    p2_id text\n",
    ").\n",
    "```\n",
    "\n",
    "The relation `person_mention` as well as the relations it references should have similar annotations (see the [complete `app.ddlog` code](../examples/spouse/app.ddlog) for full detail).\n",
    "\n",
    "Then, repeating the commands to update the search index and load the user interface will allow us to browse the expected marginal probabilities of `has_spouse` as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of the search interface showing results](https://github.com/HazyResearch/deepdive/raw/master/doc/images/browsing_results.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Customizing how data is presented\n",
    "\n",
    "<!-- TODO describe presentation annotations once it's ready -->\n",
    "\n",
    "In fact, the screenshots above are showing the data presented using a [carefully prepared set of templates under `mindbender/search-templates/`](https://github.com/HazyResearch/deepdive/tree/master/examples/spouse/mindbender/search-template/).\n",
    "In these AngularJS templates, virtually anything you can program in HTML/CSS/JavaScript/CoffeeScript can be added to present the data that is ideal for human consumption (e.g., highlighted text spans rather than token indexes).\n",
    "Please see the [documentation about customizing the presentation](http://deepdive.stanford.edu/browsing#customizing-presentation) for further detail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3. Estimating precision with Mindtagger\n",
    "\n",
    "*Mindtagger*, which is part of the Mindbender tool suite, assists data labeling tasks to quickly assess the precision and/or recall of the extraction.\n",
    "We show how Mindtagger helps us perform a labeling task to estimate the precision of the extraction.\n",
    "The necessary set of files shown below already exist [in the example under `labeling/has_spouse-precision/`](https://github.com/HazyResearch/deepdive/tree/master/examples/spouse/labeling/has_spouse-precision/).\n",
    "\n",
    "<!-- TODO describe how a task can be created from the search interface instead, once it's ready -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Preparing a data labeling task\n",
    "\n",
    "First, we can take a random sample of 100 examples from `has_spouse` relation whose expectation is higher than or equal to a 0.9 threshold as shown in [the following SQL query](../examples/spouse/labeling/has_spouse-precision/sample-has_spouse.sql), and store them in [a file called `has_spouse.csv`](../examples/spouse/labeling/has_spouse-precision/has_spouse.csv).\n",
    "\n",
    "<!-- TODO use deepdive-query instead once it allows the @expectation syntax to grab such field for variable relations -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p labeling/has_spouse-precision/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "deepdive sql eval \"\n",
    "\n",
    "SELECT hsi.p1_id\n",
    "     , hsi.p2_id\n",
    "     , s.doc_id\n",
    "     , s.sentence_index\n",
    "     , hsi.dd_label\n",
    "     , hsi.expectation\n",
    "     , s.tokens\n",
    "     , pm1.mention_text AS p1_text\n",
    "     , pm1.begin_index  AS p1_start\n",
    "     , pm1.end_index    AS p1_end\n",
    "     , pm2.mention_text AS p2_text\n",
    "     , pm2.begin_index  AS p2_start\n",
    "     , pm2.end_index    AS p2_end\n",
    "\n",
    "  FROM has_spouse_inference hsi\n",
    "     , person_mention             pm1\n",
    "     , person_mention             pm2\n",
    "     , sentences                  s\n",
    "\n",
    " WHERE hsi.p1_id          = pm1.mention_id\n",
    "   AND pm1.doc_id         = s.doc_id\n",
    "   AND pm1.sentence_index = s.sentence_index\n",
    "   AND hsi.p2_id          = pm2.mention_id\n",
    "   AND pm2.doc_id         = s.doc_id\n",
    "   AND pm2.sentence_index = s.sentence_index\n",
    "   AND       expectation >= 0.9\n",
    "\n",
    " ORDER BY random()\n",
    " LIMIT 100\n",
    "\n",
    "\" format=csv header=1 >labeling/has_spouse-precision/has_spouse.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also prepare the [`mindtagger.conf`](https://github.com/HazyResearch/deepdive/blob/master/examples/spouse/labeling/has_spouse-precision/mindtagger.conf) and [`template.html`](https://github.com/HazyResearch/deepdive/blob/master/examples/spouse/labeling/has_spouse-precision/template.html) files under [`labeling/has_spouse-precision/`](https://github.com/HazyResearch/deepdive/blob/master/examples/spouse/labeling/has_spouse-precision/) that look like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%file labeling/has_spouse-precision/mindtagger.conf\n",
    "title: Labeling task for estimating has_spouse precision\n",
    "items: {\n",
    "    file: has_spouse.csv\n",
    "    key_columns: [p1_id, p2_id]\n",
    "}\n",
    "template: template.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%file labeling/has_spouse-precision/template.html\n",
    "<mindtagger mode=\"precision\">\n",
    "\n",
    "  <template for=\"each-item\">\n",
    "    <strong title=\"item_id: {{item.id}}\">{{item.p1_text}} -- {{item.p2_text}}</strong>\n",
    "    with expectation <strong>{{item.expectation | number:3}}</strong> appeared in:\n",
    "    <blockquote>\n",
    "        <big mindtagger-word-array=\"item.tokens\" array-format=\"json\">\n",
    "            <mindtagger-highlight-words from=\"item.p1_start\" to=\"item.p1_end\" with-style=\"background-color: yellow;\"/>\n",
    "            <mindtagger-highlight-words from=\"item.p2_start\" to=\"item.p2_end\" with-style=\"background-color: cyan;\"/>\n",
    "        </big>\n",
    "    </blockquote>\n",
    "\n",
    "    <div>\n",
    "      <div mindtagger-item-details></div>\n",
    "    </div>\n",
    "  </template>\n",
    "\n",
    "  <template for=\"tags\">\n",
    "    <span mindtagger-adhoc-tags></span>\n",
    "    <span mindtagger-note-tags></span>\n",
    "  </template>\n",
    "\n",
    "</mindtagger>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Labeling data with Mindtagger\n",
    "\n",
    "Mindtagger can then be started for the task using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!mindbender tagger labeling/has_spouse-precision/mindtagger.conf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, point your browser to the URL that appears after the command (typically <http://localhost:8000>) to see a dedicated user interface for labeling data that looks like the following:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of the labeling interface showing the sampled data](https://github.com/HazyResearch/deepdive/raw/master/doc/images/mindtagger_screenshot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can quickly label the sampled 100 examples using the intuitive user interface with buttons for correct/incorrect tags.\n",
    "It also supports keyboard shortcuts for entering labels and moving between items.\n",
    "(Press the <kbd>?</kbd> key to view all supported keys.)\n",
    "How many were labeled correct, as well as other tags, are shown in the \"Tags\" dropdown at the top right corner as shown below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of the labeling interface showing tag statistics](https://github.com/HazyResearch/deepdive/raw/master/doc/images/mindtagger_screenshot_tags.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The collected tags can also be exported in various format for post-processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of the labeling interface for exporting tags](https://github.com/HazyResearch/deepdive/raw/master/doc/images/mindtagger_screenshot_export.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For further detail, see the [documentation about labeling data](http://deepdive.stanford.edu/labeling)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4. Monitoring statistics with Dashboard\n",
    "\n",
    "<!-- TODO introduce how dashboard reports can be created from search context and their values be tracked in trends, once mindbender is updated -->\n",
    "\n",
    "*Dashboard* provides a way to monitor various descriptive statistics of the data products after each pass of DeepDive improvements.\n",
    "We can use a combination of SQL, any Bash script, and Markdown in each *report template* that produces a *report*, and we can produce a collection of them as a *snapshot* against the data extracted by DeepDive.\n",
    "Dashboard provides a structure to manage those templates and instantiate them in a sophisticated way using parameters.\n",
    "It provides a graphical interface for visualizing the collected statistics and trends as shown below.\n",
    "Refer to the [full documentation on Dashboard](http://deepdive.stanford.edu/dashboard) to set up your own set of reports.\n",
    "\n",
    "<!-- TODO write about setting up some basic example snapshot config / report templates for spouse example -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of Dashboard Reports](https://github.com/HazyResearch/deepdive/raw/master/doc/images/dashboard/supervision_report.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Screenshot of Dashboard Trends](https://github.com/HazyResearch/deepdive/raw/master/doc/images/dashboard/homepage.png)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
